Goto

Collaborating Authors

 loss function


Conformal Risk Control under Non-Monotone Losses: Theory and Finite-Sample Guarantees

Aldirawi, Tareq, Li, Yun, Guo, Wenge

arXiv.org Machine Learning

Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. However, this assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as coverage and efficiency. In this paper, we study CRC under non-monotone loss functions when the tuning parameter is selected from a finite grid, a setting commonly arising in thresholding and discretized decision rules. Revisiting a known counterexample, we show that the validity of CRC without monotonicity depends critically on the relationship between the calibration sample size and the grid resolution. In particular, reliable risk control can still be achieved when the calibration sample is sufficiently large relative to the grid size. We establish a finite-sample guarantee for bounded losses over a grid of size $m$, showing that the excess risk above the target level $α$ scales on the order of $\sqrt{\log(m)/n}$, where $n$ is the calibration sample size. A matching lower bound demonstrates that this rate is minimax optimal. We also derive refined guarantees under additional structural conditions, including Lipschitz continuity and monotonicity, and extend the analysis to settings with distribution shift via importance weighting. Numerical experiments on synthetic multilabel classification and real object detection data illustrate the practical implications of non-monotonicity. Methods that explicitly account for finite-sample uncertainty achieve more stable risk control than approaches based on monotonicity transformations, while maintaining competitive prediction set sizes.


Rare Event Analysis via Stochastic Optimal Control

Du, Yuanqi, He, Jiajun, Zhang, Dinghuai, Vanden-Eijnden, Eric, Domingo-Enrich, Carles

arXiv.org Machine Learning

Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object--the committor function, which gives the probability that the system will next reach the product rather than the reactant--encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control--proportional to the gradient of its logarithm--that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.


Sparse $ε$ insensitive zone bounded asymmetric elastic net support vector machines for pattern classification

Du, Haiyan, Yang, Hu

arXiv.org Machine Learning

Existing support vector machines(SVM) models are sensitive to noise and lack sparsity, which limits their performance. To address these issues, we combine the elastic net loss with a robust loss framework to construct a sparse $\varepsilon$-insensitive bounded asymmetric elastic net loss, and integrate it with SVM to build $\varepsilon$ Insensitive Zone Bounded Asymmetric Elastic Net Loss-based SVM($\varepsilon$-BAEN-SVM). $\varepsilon$-BAEN-SVM is both sparse and robust. Sparsity is proven by showing that samples inside the $\varepsilon$-insensitive band are not support vectors. Robustness is theoretically guaranteed because the influence function is bounded. To solve the non-convex optimization problem, we design a half-quadratic algorithm based on clipping dual coordinate descent. It transforms the problem into a series of weighted subproblems, improving computational efficiency via the $\varepsilon$ parameter. Experiments on simulated and real datasets show that $\varepsilon$-BAEN-SVM outperforms traditional and existing robust SVMs. It balances sparsity and robustness well in noisy environments. Statistical tests confirm its superiority. Under the Gaussian kernel, it achieves better accuracy and noise insensitivity, validating its effectiveness and practical value.


Generalization error bounds for two-layer neural networks with Lipschitz loss function

Nguwi, Jiang Yu, Privault, Nicolas

arXiv.org Machine Learning

We derive generalization error bounds for the training of two-layer neural networks without assuming boundedness of the loss function, using Wasserstein distance estimates on the discrepancy between a probability distribution and its associated empirical measure, together with moment bounds for the associated stochastic gradient method. In the case of independent test data, we obtain a dimension-free rate of order $O(n^{-1/2} )$ on the $n$-sample generalization error, whereas without independence assumption, we derive a bound of order $O(n^{-1 / ( d_{\rm in}+d_{\rm out} )} )$, where $d_{\rm in}$, $d_{\rm out}$ denote input and output dimensions. Our bounds and their coefficients can be explicitly computed prior to the training of the model, and are confirmed by numerical simulations.


Expectation Maximization (EM) Converges for General Agnostic Mixtures

Ghosh, Avishek

arXiv.org Machine Learning

Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using $k$ linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \cite{pal2022learning,ghosh_agnostic} the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emph{fit} $k$ lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emph{fitting} $k$ parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emph{optimal} solution in the non-generative setup beyond mixture of linear regression.


Learning interacting particle systems from unlabeled data

Wei, Viska, Lu, Fei

arXiv.org Machine Learning

Learning the potentials of interacting particle systems is a fundamental task across various scientific disciplines. A major challenge is that unlabeled data collected at discrete time points lack trajectory information due to limitations in data collection methods or privacy constraints. We address this challenge by introducing a trajectory-free self-test loss function that leverages the weak-form stochastic evolution equation of the empirical distribution. The loss function is quadratic in potentials, supporting parametric and nonparametric regression algorithms for robust estimation that scale to large, high-dimensional systems with big data. Systematic numerical tests show that our method outperforms baseline methods that regress on trajectories recovered via label matching, tolerating large observation time steps. We establish the convergence of parametric estimators as the sample size increases, providing a theoretical foundation for the proposed approach.


A Comparative Investigation of Thermodynamic Structure-Informed Neural Networks

Li, Guojie, Hong, Liu

arXiv.org Machine Learning

Physics-informed neural networks (PINNs) offer a unified framework for solving both forward and inverse problems of differential equations, yet their performance and physical consistency strongly depend on how governing laws are incorporated. In this work, we present a systematic comparison of different thermodynamic structure-informed neural networks by incorporating various thermodynamics formulations, including Newtonian, Lagrangian, and Hamiltonian mechanics for conservative systems, as well as the Onsager variational principle and extended irreversible thermodynamics for dissipative systems. Through comprehensive numerical experiments on representative ordinary and partial differential equations, we quantitatively evaluate the impact of these formulations on accuracy, physical consistency, noise robustness, and interpretability. The results show that Newtonian-residual-based PINNs can reconstruct system states but fail to reliably recover key physical and thermodynamic quantities, whereas structure-preserving formulation significantly enhances parameter identification, thermodynamic consistency, and robustness. These findings provide practical guidance for principled design of thermodynamics-consistency model, and lay the groundwork for integrating more general nonequilibrium thermodynamic structures into physics-informed machine learning.


The Conjugate Domain Dichotomy: Exact Risk of M-Estimators under Infinite-Variance Noise in High Dimensions

Agiropoulos, Charalampos

arXiv.org Machine Learning

This paper studies high-dimensional M-estimation in the proportional asymptotic regime (p/n -> gamma > 0) when the noise distribution has infinite variance. For noise with regularly-varying tails of index alpha in (1,2), we establish that the asymptotic behavior of a regularized M-estimator is governed by a single geometric property of the loss function: the boundedness of the domain of its Fenchel conjugate. When this conjugate domain is bounded -- as is the case for the Huber, absolute-value, and quantile loss functions -- the dual variable in the min-max formulation of the estimator is confined, the effective noise reduces to the finite first absolute moment of the noise distribution, and the estimator achieves bounded risk without recourse to external information. When the conjugate domain is unbounded -- as for the squared loss -- the dual variable scales with the noise, the effective noise involves the diverging second moment, and bounded risk can be achieved only through transfer regularization toward an external prior. For the squared-loss class specifically, we derive the exact asymptotic risk via the Convex Gaussian Minimax Theorem under a noise-adapted regularization scaling. The resulting risk converges to a universal floor that is independent of the regularizer, yielding a loss-risk trichotomy: squared-loss estimators without transfer diverge; Huber-loss estimators achieve bounded but non-vanishing risk; transfer-regularized estimators attain the floor.


Minimax Generalized Cross-Entropy

Bondugula, Kartheek, Mazuelas, Santiago, Pérez, Aritz, Liu, Anqi

arXiv.org Machine Learning

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.


Beyond the Mean: Distribution-Aware Loss Functions for Bimodal Regression

Mohammadi-Seif, Abolfazl, Soares, Carlos, Ribeiro, Rita P., Baeza-Yates, Ricardo

arXiv.org Machine Learning

Despite the strong predictive performance achieved by machine learning models across many application domains, assessing their trustworthiness through reliable estimates of predictive confidence remains a critical challenge. This issue arises in scenarios where the likelihood of error inferred from learned representations follows a bimodal distribution, resulting from the coexistence of confident and ambiguous predictions. Standard regression approaches often struggle to adequately express this predictive uncertainty, as they implicitly assume unimodal Gaussian noise, leading to mean-collapse behavior in such settings. Although Mixture Density Networks (MDNs) can represent different distributions, they suffer from severe optimization instability. We propose a family of distribution-aware loss functions integrating normalized RMSE with Wasserstein and Cramér distances. When applied to standard deep regression models, our approach recovers bimodal distributions without the volatility of mixture models. Validated across four experimental stages, our results show that the proposed Wasserstein loss establishes a new Pareto efficiency frontier: matching the stability of standard regression losses like MSE in unimodal tasks while reducing Jensen-Shannon Divergence by 45% on complex bimodal datasets. Our framework strictly dominates MDNs in both fidelity and robustness, offering a reliable tool for aleatoric uncertainty estimation in trustworthy AI systems.